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1  Introduction. 


In  this  paper  we  study  the  recent  push-relabel  [7,  111  class  of  algorithms  for  the  maximum  flow 
problem  in  capacitated  directed  networks.  Several  researchers  [1,  3,  5,  12]  concluded  that  their 
sequential  implementations  of  push-relabel  algorithms  are  superior  to  previously  used  codes.  We 
focus  on  the  behavior  of  the  push-relabel  method  in  massively  parallel  environments. 

The  push-relabel  method  is  the  first  theoretically  efficient  algorithm  for  the  TnaYiTmim  flow 
problem  that  is  potentially  practical;  the  previous  algorithm  of  Shiloach  and  Vishkin  [18],  based 
on  Dinitz’  blocking  flow  method  [6],  requires  the  amount  of  memory  quadratic  in  the  number  of 
nodes  in  the  input  network. 

The  push-relabel  method  uses  two  operations,  push  and  relabel,  to  manipulate  current  flow  and 
distance  labeling  functions.  In  the  parallel  implementation  the  algorithm  pushes  simultaneouslv 
from  all  nodes  to  which  the  push  operation  applies;  we  refer  to  this  method  of  pushing  as  parallel 
push.  Similarly,  parallel  relabel  applies  relabeling  operation  to  all  eligible  nodes  simultaneously. 

The  push-relabel  algorithm  works  regardless  of  the  order  in  which  the  push  and  relabel  oper¬ 
ations  are  applied.  It  also  works  with  any  correct  labeling  procedure.  This  robust  nature  of  the 
algorithm  allows  us  to  exp erunent  with  various  strategies  for  ordering  push  and  relabel  operations, 
and  with  a  variety  of  relabeling  procedures.  AH  of  the  push  and  relabel  operations  are  suitable 
for  parallelization.  The  goal  is  to  determine  the  ordering  and  relabel  procedures  most  suitable  for 
a  practical  parallel  implementation. 

This  paper  is  a  report  on  experiments  with  various  strategies  for  the  parallel  push-relabel 
algorithm.  The  experiments  were  conducted  on  a  Connection  Machine  [13]  model  CM-2,  with 
32K  processors  (K=1024).^  Each  processor  can  directly  access  256K  bits  of  local  memory,  so  the 
entire  system  has  1  gigab3rtes  of  memory.  The  starting  point  for  our  work  was  the  Connection 
Machine  implementation  of  the  algorithm  described  in  [8].  Our  current  implementation,  however, 
improves  the  original  one  in  several  areas,  including  better  ordering  and  pipelining  of  operations 
of  messages.  Also,  the  original  implementation  was  on  CM-1  using  the  *LISP  interpreter  where 
as  the  current  implementation  is  on  CM-2  using  the  C*  compiler.  Our  experimental  results  are 
very  good  on  a  wide  class  of  problems;  we  can  solve  large  (milhons  of  arcs)  problems  in  a  matter 
of  minutes. 

This  paper  contains  five  sections  including  the  introduction.  In  Section  2  we  outline  the  push- 
relabel  algorithm  and  discuss  parallel  implementation  of  push  and  various  relabel  operations 
including  Derigs  and  Meier’s  gap-relabehng  and  parallel  breadth-first  search.  In  Section  3  we 
describe  the  mapping  of  networks  to  the  Connection  Machine  processors.  This  mapping  is  designed 
to  take  advantage  of  the  parallel  prefix  and  suflBx  operations  provided  in  the  Connection  Machine. 
In  Section  4  we  describe  in  more  detail  parallel  push  and  relabel  operations  in  the  context  of  this 
mapping.  In  Section  5  we  report  on  the  running  time  of  the  program  on  various  Sctmple  inputs 
and  present  our  conclusions. 


’We  used  16K  processors  in  our  experiments  because  full  32K  processors  were  often  unavailable. 
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2  The  Parallel  Push-Relabel  Algorithm. 


In  this  section  we  review  some  of  the  basic  concepts  of  the  push-reiabel  method  and  its  parallel 
implementation.  We  assume  that  the  reader  is  familiar  with  [11].  (See  also  [10].)  Because  of 
high-grain  parahehsm  of  the  CM- 2  architecture,  the  goal  of  our  implementation  is  to  get  a  tight 
inner  loop  of  the  algorithm  rather  then  to  achieve  high  processor  utihzation  by  careful  processor 
scheduling  (see  [9,  18]). 

A  flow  network  is  a  directed  graph  G  =  {V^  E,  5,  t,  u),  where  V  and  E  are  node  set  and  arc  set, 
respectively,  s  and  t  are  the  source  and  the  sink,  respectively,  and  u  is  a  non-negative  capacity 
function  on  the  arcs.  We  define  n  =  \V\  and  m  =  |£^|,  and  assume  that  for  each  arc  (r,ii?),  the 
arc  (u:,t;)  is  also  present.  A  flow  is  a  function  on  the  arcs  that  satisfies  capacity  constraints  on  aU 
arcs  and  conservation  constraints  on  aU  nodes  except  the  source  cind  the  sink.  The  conservation 
constraint  at  a  node  v  indicates  that  the  excess  e/(i;),  defined  as  the  difference  between  the 
incoming  and  the  outgoing  flows,  is  equal  to  zero.  A  preflow  [14]  satisfies  the  capacity  constraints 
and  the  relaxed  version  of  conservation  constraints  that  requires  the  excesses  to  be  nonnegative. 

An  arc  is  residual  if  the  flow  on  it  can  be  increased  without  violating  the  capacity  constraints, 
and  saturated  otherwise.  The  residual  capacity  Uf{v,w)  of  an  arc  (t;,it;)  is  the  amount  by  which 
the  arc  flow  can  be  increased.  The  residual  graph  is  induced  by  the  residual  arcs.  A  residual  arc 
(t;,u;)  is  admissible  if  d{v)  >  d{w). 

The  distance  labeling  d  :  F  — >  N  satisfies  the  following  conditions:  d{s)  ^  n,  d{t)  =  0,  and  for 
every  residual  arc  (i;,x£;),  d{v)  <  d{w)  -f  1.  It  is  easy  to  see  that  if  d{v)  <  n,  then  d{v)  is  a  lower 
bound  on  the  actual  distance  from  t;  to  t  in  the  residual  graph,  and  if  d{v)  >  n,  then  d{v)  -  n  is 
a  lower  bound  on  the  actual  distance  of  t;  to  s. 

A  push-relabel  algorithm  maintains  a  preflow  and  a  distance  labeling.  We  say  that  a  node  v 
is  active  if  v  ^  ^/(^)  >  0-  The  preflow  is  modified  using  push  operation,  which  pushes 

excess  flow  from  an  active  node  to  an  adjacent  one  that  has  a  smaller  distance  label.  When  an 
active  node  cannot  push  its  excess  because  its  label  is  at  most  that  of  its  neighbors,  then  the 
distance  labeling  is  modified  using  a  relabel  operation,  which  increases  the  distance  label  of  the 
node  to  the  largest  value  allowed  by  the  labeling  constraints. 

Our  parallel  implementation  works  as  follows.  The  preflow  f  is  initialized  to  f{s,w)  =  u(s,t£;) 
for  all  nodes  u  adjacent  to  s,  and  f{v,w)  =  0  for  aU  t;  7^  s.  The  initial  distance  labeling  is 
computed  using  breadth-first  search  (BFS)  on  the  residual  graph  induced  by  the  initial  preflow. 
The  prefiow  and  distance  labeling  are  updated  using  the  push  and  relabel  operations,  respectively, 
until  no  active  nodes  remzdn. 

The  parallel  push  operation  works  as  follows.  For  every  active  node,  the  parallel  push  operation 
distributes  their  excess  among  admissible  outgoing  arcs  as  described  in  Figure  1.  In  the  figure, 


Operation  Parallel-Push 

For  all  active  nodes  v  in  parallel  do 

push  flow  fiom  V  until  ey(v)  =  0  or  Vu;  such  that  d{w)  <  d{v)^  Uf{v^w)  =  0. 


Figure  1:  The  parallel  push  operation. 
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Operation  Relabel 


For  all  nodes  v  g  {5,^}  in  parallel  do 
d'{v)  ^  min{d(it;)  -r  1  :  Uf{v,w)  >  0}; 
If  d{v)  =£  d'{v)  then 

broadcast  d{v)  to  all  neighbors  of  V] 
end; 


Figure  2:  The  parallel  relabel  operation. 


Operation  gap-relabel 

(1)  Find  the  smallest  g  where  0  <  ^  <  n  and  no  node  has  label  g. 

(2)  For  all  nodes  v  in  parallel  do: 

if  ^  <  d{v)  <  n  set  d(t;)  n. 

(3)  For  all  v  which  have  a  new  label  broadcast  the  new  labels  to  all  neighbors  of  v. 

Figure  3:  The  gap-relabel  operation. 

“push  flow  from  v  to  ii;”  means  increasing  the  flow  from  r  to  ir  by  the  minimum  of  e/(t;)  and 
Uf{w)  and  updating  the  excesses  of  v  and  w\ 

We  use  several  kinds  of  parallel  relabel  operations  in  our  implementation.  The  simple  parallel 
relabel,  described  in  Figure  2,  sets  a  distance  label  of  every  node  except  s  and  t  to  one  plus  the 
minimum  distance  label  of  its  residual  neighbors. 

Another  relabeling  procedure  is  the  gap-label  of  Derigs  and  Meier  [5]  based  on  the  following 
observation.  Suppose  at  certain  stage  of  the  algorithm  some  nodes  are  labeled  0,  some  labeled  1, 
and  so  on,  through  5  —  1  <  n,  but  no  node  is  labeled  g,  although  some  other  nodes  are  labeled 
by  integers  between  g  and  n.  Derigs  cmd  Meier  observe  that  the  sink  is  not  reachable  from  any  of 
the  nodes  whose  labels  d  are  strictly  between  g  and  n.  Therefore,  the  labels  of  such  nodes  may 
just  be  increased  to  n.  This  procedure  may  easily  be  implemented  in  parallel;  see  Figure  3. 

Note  that  the  most  accurate  labeling  is  obtained  by  applying  BFS  backwards  from  the  sink  and 
forward  from  the  source.  This  can  be  implemented  by  applying  parallel  relabeling  operations  xmtil 
distance  labels  stop  changing-  Doing  this  during  every  pulse  is  too  expensive.  The  implementation 
of  [8]  performs  a  breadth-first  search  once  after  a  sequence  of  pulses  so  that  the  breadth-first  search 
time  can  be  amortized  over  the  pulse  time.  We  experimented  with  this  approach  as  well  as  with 
using  the  gap-relabel  operation  instead. 

No  theoretical  result  suggests  that  gap-relabel  and  BFS  operations  improve  the  worst-case 
time  bounds  on  the  algorithm.  In  practice,  both  periodic  BFS  and  gap-relabel  operations  reduce 
the  number  of  pulses  drastically.  The  gap-relabel  operation,  however,  is  much  less  expensive 
(usually  faster  than  simple  parallel  relabel  in  our  implementation)  and  as  a  result  gives  better 
running  times  most  of  the  time  despite  of  the  fact  that  the  labeling  after  such  an  operation  is  not 
exact. 

Our  parallel  implementation  can  be  viewed  as  running  the  parallel  push,  relabel,  and  gap- 
relabel  processes  “simultaneously”.  However,  due  to  the  SIMD  nattire  of  the  machine,  we  have  to 
time- multiplex  the  operations.  We  can  adjust  relative  speed  of  the  operations  by  running  several 
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push  operations  followed  by  several  relabel  operations  followed  by  a  gap-relabel  operation.  (Note 
that  it  does  not  make  sense  to  run  two  gap-relabel  operations  without  running  a  simple  relabel 
operation  in  between.)  We  also  can  pipeline  some  of  the  operations  as  described  in  the  next 
section. 

A  pulse  is  a  sequence  of  pushes  and  relabelings  (possibly  pipelined)  that  is  repeated  over  and 
over.  A  general  rule  is  that  the  closer  the  labels  are  to  their  actual  distance  to  the  sink,  the  fewer 
pushes  will  be  performed.  Therefore,  the  more  accurate  labeling  may  reduce  the  total  number  of 
pulses.  However,  maintaining  a  very  accurate  labeling  is  too  expensive.  Our  strategy  is  to  update 
labels  as  accurately  as  possible  without  increasing  the  time  spent  in  each  pulse  by  too  much. 

In  Section  5,  we  describe  our  implementation  in  more  detail. 


3  CM  Architecture  and  Tools. 

In  this  section  we  review  some  aspects  of  the  CM  architecture  relevant  to  our  program.  Then, 
we  describe  a  set  of  useful  operations  available  on  the  Connection  Machine  for  accumulating  in 
parallel  partial  sums,  partial  minimums,  etc. 


3.1  Connection  Machine  Architecture. 


The  following  is  a  brief  outline  of  the  CM  architecture.  For  more  details  see  the  book  by  Hillis 
[13]  and  documents  from  the  Thinking  Machine  Corporation,  for  instance  [4]. 

The  Connection  Machine  is  a  distributed  memory  parallel  computer  [16].  It  consists  of  thou¬ 
sands  of  processors  connected  by  a  routing  network.  Each  processor  has  local  memory.  The  local 
memory  of  a  processor  can  be  accessed  by  other  processors  via  the  routing  network.  The  CM  is 
a  single  instruction,  multiple  data  (SIMD)  machine:  the  program  is  stored  in  a  host  computer 
which  executes  a  sequential  program  containing  parallel  instructions.  When  a  parallel  instruction 
appears  in  the  program,  the  host  broadcasts  it  to  all  processors.  Each  processor,  depending  on  its 
memory  contents,  either  executes  the  instructions  or  remains  idle.  The  operation  of  the  machine 
is  totally  synchronous:  the  next  instruction  does  not  start  until  all  processors  have  completed  the 
execution  of  the  current  instruction. 


Each  processor  can  access  its  own  memory,  or  it  can  access  the  memory  of  other  processors. 
However,  accessing  the  local  memory  is  much  faster  than  accessing  other  processors’  memory.  In 
general,  the  time  required  by  a  processor  to  access  memory  of  a  processor  that  is  ‘‘close”  (in  the 
routing  network)  is  less  than  the  time  required  to  access  the  memory  of  a  processor  that  is  fax. 
If  during  the  execution  of  an  instruction  severed  processors  access  memories  of  other  processors, 
the  longest  memory  access  time  determines  the  execution  time  of  the  instruction.  A  routing  cycle 
is  the  amount  of  time  it  takes  to  execute  such  an  instruction.  It  is  important  to  realize  that  the 
routing  cycle  time  varies  depending  on  the  interprocessor  communication  pattern  and  the  size  of 
data  being  accessed. 


The  Connection  Machine  software  also  provides  the  notion  of  virtual  processors.  The  user 
may  request  any  number  of  processors  he  or  she  wishes.  If  this  number  is  larger  than  the  number 
of  physical  processors,  each  physical  processor  simulates  v  virtual  processors,  where  v  is  the  VP 
ratio,  that  it  .  = 


4 


3.2  Parallel  Prefix  and  SuflSx  Operations. 


Parallel  prefix  operations  have  been  recognized  as  fundamental  parallel  operations,  and  then- 
implementation  and  use  in  parallel  algorithms  has  been  widely  studied;  see  e.g.  [15,  17],  Given 
an  associative  binary  operation  and  a  sequence  of  s  numbers  ai,a2,  •  •  •  ,a,,  the  psirallel  prefix 
operation  maps  this  sequence  into  the  sequence  ai,ai  *  02,  •  •  •  ,ai  *  02  »  •  •  ■  *  a,.  Similarly 
the  parallel  suffix  operation  maps  this  sequence  into  oi  «  02  *  •  •  •  *  a,,  •  •  • ,  a,_i  *  a,,  a,.  In  our 
implementation  of  the  parallel  push-relabel  method,  we  use  prefix  and  suffix  operations  with 
replaced  by  addition,  min,  and  copy. 

An  extended  form  of  parallel  prefix  sind  suffix  operations  are  segmented  parallel  prefix  opera¬ 
tion.  In  the  segmented  form  a  list  of  sequences  5i ,  52 ,  •  •  • ,  5/  is  mapped  into  5j ,  5^ ,  •  •  • ,  5/,  where 
each  sequence  5,-  is  obtained  from  the  corresponding  sequence  5,-  using  the  parallel  prefix  or  suffix 
operation.  The  lengths  of  sequences  Si  may  be  different. 

A  sequence  of  numbers  is  stored  in  the  Connection  Machine  by  placing  the  entries  of  the 
sequence  in  contiguous  processors.  For  simple  operations  such  as  addition,  copying,  and  TniniTnnm 
the  time  required  to  perform  parallel  prefix  and  suffix  operations  is  of  the  same  order  of  magnitude 
as  the  time  required  for  one  routing  cycle  on  the  data  of  the  same  size.  Although  the  routing  cycle 
time  and  the  time  to  perform  a  psuallel  prefix  operation  vary  depending  on  the  communication 
pattern  and  on  the  tjT)e  of  the  parallel  prefix  operation,  we  shall  refer  to  these  times  as  simply 
the  routing  cycle  in  om  description  of  algorithms.  The  meiin  point  to  remember  is  that  a  routing 
cycle  is  much  larger  than  a  local  memory  access. 


4  Data  Structures  and  Implementation  Details. 

4.1  Parallel  Implementation  of  the  Pulse  Procedure. 

We  use  a  mapping  of  the  input  network  to  the  machine  as  first  described  by  Blelloch  in  [2].  This 
mapping  allows  us  to  use  locality  of  data  through  the  parallel  prefix  (suffix)  operations.  Each 
node  V  is  assigned  a  processor  which  we  call  a  node  processor.  Each  arc  (v,  w)  is  also  assigned 
a  processor  called  an  arc  processor.  Recall  that  for  each  arc  we  assume  that  the  eirc 

(it;,  v)  is  also  present  in  the  network  and  therefore,  in  our  implementation,  a  separate  arc  processor 
is  assigned  to  this  arc.  We  call  processors  Ptjw  and  P^  pair  processors.  In  the  machine,  each 
processor  P^,  is  followed  by  all  the  processors  P^  corresponding  to  the  circs  incident  on  it;  the 
positions  of  sxc  processors  associated  with  a  node  are  arbitrary  within  themselves,  and  so  is  the 
positions  of  the  node  processors  as  long  as  the  associated  arc  processors  follow  them.  Each  arc 
processor  P^t^  stores  the  processor  address  of  its  pair  processor  P^. 

The  part  of  the  program  that  reads  in  the  input,  allocates  processors,  and  initializes  the 
system  is  relatively  simple.  The  main  part  consists  of  application  of  the  pulse  procedure  until 
no  active  nodes  remain.  The  implementation  of  this  procedure  is  summarized  in  Figure  4,  This 
implementation  include  the  simple  relabel  operations  as  part  of  it  because  all  of  the  versions 
use  this  operation  predominantly,  although  occasionally  other  relabel  operations  such  as  BFS  or 
gap-relabel  is  used. 

Steps  1-3  and  7  implement  the  parallel  push  procedure.  In  Step  1,  the  value  of  excess  at 
each  node  processor  P^  is  sent  to  the  arc  processors  immediately  following  it.  Step  2  distributes 
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Procedure  pulse 


^  ^  ^^9-pT^^fi^-copy  operation. 

(2)  {  distribute  excess  } 

For  all  use  seg- suffix- add  to  compute  the  amount  that  can  be  pushed  to  lower 
labeled  nodes  through  arcs  that  foDow  (u,  w)  on  the  incident  list  of  v. 

For  aU  compute  the  amount  c(v,w)  to  be  pushed  from  r  to  to 

(3)  {  7us*h  flow  }  ^ 

For  all  Px,yf  do  if  o’(ti,  tz;  >  0  do  begin 

fiv  w)  4-  f{v,w)  ^  a{v,w);  Uf{v,w)  ^  Uf{v,w)  -  a{v,w)] 
send  a  message  containing  c7-(r,tz;)  to  processor 

end. 


For  all  Pxov  that  received  <T(ti,  w'j  do  begin 

If  atmpfr  relabel  IS  the  relabeling  operation  chosen  then  begin 
(^)  {  Compute  new  distance  labels  } 

For  all  do 


(5) 

(6) 


end. 


if  u/(t',m)  >  0  then  head-label{y,w)  *—  d(r)  +  1 
else  kead-label(v,  to)  ♦—  2n. 

For  all  r  e  F  -  {s,t}  compute  Reto-d(t;)  using  seg-suffix-min. 

For  all  r  e  F  -  {£,<}  copy  nev>.d{v)  to  all  using  seg-prefix-covy 
{  Broadcast  new  labels }  ^  s  r  j  yn 

For  all  such  that  n  ^  {s,t}  do 
If  d{v)  new  —  d{v)  then 


send  a  message  containing  the  value  of  d(r)  to 
and  set  d{v)  to  nev)-d{v)  that  was  broadcast. 


(7)  •{  Update  excess  } 

For  all  to  €  F  do  begin 

Use  seg-suffix-add  to  compute  the  amount  of  flow  new-e/(to)  pushed  into  to 
ef{w)*—ef{w)+new-ef{w)  ’ 

end. 


Figure  4:  Implenientation  of  the  pulse  procedure. 
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Operation  Parallel  BPS 

d^{s)  ^  n,  and  d'(t)  0; 

For  all  nodes  v  —  {5,^}  d'(v)  ^  2n; 

Repeat 

Run  steps  (3),  (4)  and  (5)  of  Pulse  procedure. 
Until  n€v>-d{v)  =  d{v)  for  all  nodes  v. 


Figure  5:  Implementation  of  parallel  breadth- first  search  operation. 

the  node  excess  to  the  outgoing  arcs.  First,  each  arc  processor  determines  how  much  excess 
may  be  sent  through  its  arc.  This  is  equal  to  the  residual  capacity  if  d{w)  <  d{v)  (that  is  if  tr  is 
estimated  to  be  closer  to  sink)  and  zero  otherwise.  Then  a  seg-suffix-add  is  performed  on  these 
values.  Now,  each  arc  processor  has  information  about  the  excess  to  be  pushed  from  v,  the 
amount  it  can  push,  and  the  amount  that  can  be  pushed  through  the  arcs  that  follow  (u,  w)  on  the 
arc  list  of  r.  This  information  is  enough  to  compute  the  amount  (t{v,w)  to  be  pushed  through  the 
arc  {y,  w).  After  an  execution  of  the  seg-suffix-add  operation,  each  node  processor  Pn  contains  the 
information  about  how  much  excess  can  be  pushed  from  the  node  v  at  this  pulse.  The  processor 
sets  the  value  of  its  variable  e/(v)  to  the  amount  that  will  remain  after  the  pushing.  Finally  in  step 
3,  all  arc  processors  P„„  for  which  the  amount  tr{v,w)  >  0,  increase/(t>,  ir)  by  <r{v,w),  decrease 
Uf{v,w)  by  the  same  amount,  and  send  the  value  of  <t{v,w)  to  their  pair  processor  Pc*.  Each 
processor  P^v  that  receives  such  a  message  decreases  f{w,v)  by  cr{v,w)  and  increases  Uf{w,v)  by 
the  same  amount.  Step  7  computes  the  new  excesses  on  each  node  by  performing  a  seg-suffix-add 
on  the  amount  of  new  flow  pushed  through  each  arc,  and  this  amount  is  added  to  ey(v). 

Steps  4-6  implement  the  simple  relabel  operation.  In  Step  4,  each  processor  sets  its 
variable  head-label  to  either  d{w)  + 1  or  2n,  depending  on  whether  the  arc  (r , «;)  is  residual  or  not. 
This  process  involves  only  local  memory  access.  Next,  a  seg-suffdx-min  operation  is  performed  on 
the  head-label  variable,  and  as  a  result  each  node  processor  P„  contains  new  value  of  d{v).  In  Step 
5  all  node  processors  except  P,  and  Pt  copy  this  value  to  their  corresponding  arc  processors  using 
seg-preftx-copy.  In  Step  6  each  arc  processor  P^  checks  if  the  new  d(v)  is  different  from  the  old 
one  and  if  so,  sends  a  message  to  its  pair  processor  updating  d{v)  in  the  pair  processor. 

Each  step  1  through  7  contains  either  a  segmented  parallel  prefix  (suffix),  or  a  commumcation 
primitive,  and  the  running  time  of  each  step  is  donainated  by  the  primitive.  Therefore,  the  overall 
running  time  of  the  pulse  procedure  is  roughly  seven  routing  cycles  of  the  machine.  Also  observe 
that  general  communications  are  done  along  paths  that  are  fixed  through  entire  program:  each 
processor  Pmo  has  to  communicate  to  processor  P^n  in  steps  3  and  6  and  these  are  the  only  general 
communication  operations.;  Therefore,  the  communication  path  has  to  be  computed  only  once  for 
each  arc  processor  and  the  same  information  is  used  throughout  the  program. 

To  implement  the  parallel  BFS  operation  we  simply  take  steps  4-6  of  pulse  and  run  them 
over  and  over  until  the  labels  do  not  change.  The  number  of  times  the  simple  relabel  is  iterated 
in  a  BFS  operation  is  at  most  the  larger  of  maximum  distance  of  a  node  to  the  sink  (if  sink  is 
reachable)  and  TnaTirmnn  distance  of  a  node  to  the  source  (if  sink  is  not  reachable)  in  the  residual 
graph. 

The  gap-relabel  procedure  is  also  easy  to  implement  on  the  Connection  machine;  see  Figure 
6.  Clearly  this  procedure  is  not  much  more  costly  than  a  simple  relabel  operation  (roughly  four 
routing  cycles  vs.  three).  However,  it  may  increase  the  labels  of  many  nodes  by  a  substantial 
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Operation  Parallel  gap-relabel 

(1)  For  each  node  whose  label  ci(T;)  satisfies  d{v)  <  n  in  parallel  do: 

Broadcast  a  flag  to  the  processor  numbered  d(i;); 

(2)  Find  the  smallest  g  where  processor  g  did  not  receive  any  message  in  step  1). 

(3)  For  all  nodes  v  in  parallel  do: 

if  g  <  d{v)  <  n  set  d{v)  ^  n. 

(4)  For  all  v  with  new  labels  copy  new~d{v)  to  all  using  s eg -prefix- copy. 

(5)  For  all  which  received  new  labels  send  a  message  to  P^^  containing  n€w-d{v). 

Figure  6:  Implementation  of  the  parallel  gap-relabel  operation. 


amount. 

We  have  experimented  with  several  variants  of  using  gap-relabel  and  BFS  operations.  In  all 
of  the  variants  each  pulse  uses  pushes  and  simple  relabels,  but  certain  times  instead  of  the  simple 
relabel  a  BFS  or  gap-relabel  operation  is  used. 

When  using  BFS  we  follow  the  following  rule.  We  save  the  amount  of  computational  work 
done  in  the  last  call  to  BFS.  Then  we  accumulate  the  amount  of  work  done  by  the  simple  relabel 
since  the  last  call  to  BFS.  We  also  fix  a  parameter  k.  If  k  times  the  amount  of  work  since  the  last 
call  to  BFS  exceeds  the  amoimt  of  work  in  the  last  BFS  then  we  use  BFS,  otherwise  we  use  simple 
relabeling.  The  amount  of  work  itself  can  be  measured  in  several  ways.  One  way  is  to  simply 
look  at  the  CPU  time  used.  Another  way  is  to  count  the  number  of  “expensive”  operations, 
in  this  case  the  number  of  routing  cycles.  Each  simple  relabel  contributes  three  routing  cycles 
(one  parallel  suffix  copy,  one  parallel  prefix  min,  and  one  general  communication  step),  and  the 
work  accumulated  by  the  simple  relabeling  procedure  is  simply  three  times  the  number  of  pulses 
since  last  call  to  BFS.  The  amount  of  work  in  each  BFS  varies  as  the  residual  graph  changes 
with  each  new  prefiow.  We  havensed  this  technique  for  both  push-relabel  and  push-push-relabel 
methods.  The  latter  is  a  variation  of  a  the  push-relabel  method  where  we  only  relabel  every  other 
pulse. ^  (One  may  think  of  this  method  as  choosing  the  relabeling  operation  that  does  nothing, 
and  alternate  using  this  operation  with  simple  relabeling.)  We  also  tested  this  technique  with  the 
pipelined  Vciricuits  of  push— relabel  and  push— push-relabel  techniques  (to  be  discussed  in  the  next 
section.) 

Another  approach  is  to  use  push  and  relabel  operations  but  after  each  relabel  to  apply  a 
gap-relabel  operation  as  well.  Our  experiments  show  that  one  does  not  need  to  apply  gap-relabel 
every  time.  We  fix  a  parameter  k  and  call  gap-relabel  after  every  k  pulses.  Again,  we  tested 
gap-relabel  with  both  push— relabel  euid  push— push— relabel  variants  of  pulse  and  their  pipeHned 
versions.  AU  of  the  timings  are  reported  in  Section  5. 


4.2  Pipelining  Independent  Operations. 

On  the  Connection  Machine,  if  two  segmented  parallel  prefix  or  suffix  operations  work  on  exactly 
the  same  sequences  and  perform  the  same  binary  operations,  it  is  possible  to  pipeline  them  so  that 
the  pipelined  operation,  performed  on  the  two  sequences  at  once,  is  faster  than  two  operations, 
each  performed  on  one  of  the  sequences  at  a  time.  For  instance,  steps  1  and  5  of  the  pulse  could 

^In  general,  a  pulse  can  have  x  push  operations  followed  by  y  relabel  operations. 
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be  pipelined,  and  so  could  steps  2  and  7.  Steps  3  and  6  also  do  the  same  kind  of  operation, 
except  that  they  involve  message  routing  instructions.  In  theory  we  should  be  able  to  get  better 
performance  by  pipelining  the  message  routing  steps,  but  on  the  Connection  Machine  we  have 
not  seen  significant  improvement.  Pipelining  the  segmented  prefix  operations,  however,  results  in 
about  10  to  20  percent  improvement. 

There  cire  seven  routing  cycles  in  the  pulse  procedure  (including  cycles  in  simple  relabeling). 
Some  steps  need  to  be  performed  after  others  (for  instance.  Step  6  must  follow  Step  5,  and  Step 
5  must  follow  Step  4),  wherecis  others  are  independent  of  each  other  (for  example,  steps  1  and  2 
do  not  depend  on  each  other  and  either  may  follow  the  other.)  Figure  7  shows  the  dependency 
relationship  among  these  steps. 

The  problem  with  pipelining  steps  1  and  5,  2  and  7,  and  3  and  6  is  that  they  are  sequentially 
related  in  the  dependency  graph.  In  order  to  curb  this  problem  we  take  advantage  of  the  robustness 
of  the  method.  The  dependency  of  Step  4  on  Step  3  exists  so  that  in  the  relabel  step  we  have 
the  updated  residual  capacities.  Also,  the  dependency  of  Step  2  on  Step  6  (of  the  previous 
pulse)  exists  so  that  the  push  operation  is  done  based  on  new  labels.  These  dependencies  may 
be  loosened  somewhat  so  that  steps  1  and  5,  steps  2  and  7,  and  steps  3  and  6  can  be  pipelined. 
The  disadvantage  is  that  now  the  labeling  becomes  less  accurate,  and  this  translates  into  more 
iterations  of  the  pulse  procedure.  The  advantage  is  that  now  each  pulse  has  only  four  routing 
cycles,  three  of  which  are  pipelined  (and  thus  operate  on  longer  data).  See  Figure  8. 

The  question  is  whether  the  time  saved  in  each  pulse  more  than  compensates  the  time  lost 
due  to  increased  number  of  pulses.  We  report  on  the  experiments  in  the*  next  section. 

We  also  used  pipelining  on  the  push-push-relabel  implementation.  (Recall  that  in  this  imple¬ 
mentation  we  call  the  relabel  operation  only  every  other  pulse.)  The  dependency  graph  for  this 
version  of  the  algorithm  is  unfolded  in  Figure  9.  Notice  that  every  stage  in  Figure  9  is  equivalent 
to  two  pulses,  only  one  of  which  has  a  relabel  stage.  Thus  in  this  form  the  total  number  of  routing 
cycles  for  two  pulses  is  seven,  whereas  in  Figure  8  there  are  eight  routing  cycles  per  two  pulses. 


5  Experimental  Results. 

In  this  section  we  report  on  the  running  times  of  our  program  on  several  classes  of  mediiun 
and  large  networks.  These  experiments  were  conducted  on  two  similar  Connection  Machines, 
one  located  at  the  Thinking  Machine  Corporation  in  Cambridge,  Massachusetts,  and  the  other 
one  at  the  Army  High-Performance  Computing  Research  Center  (AHPCRC)  at  the  University 
of  Minnesota  in  Minneapolis.  Both  machines  have  32K  processors  with  1  Gigabytes  of  memory. 
The  timings  reported  here  are  based  on  test  runs  on  the  machine  in  AHPCRC.  The  program 
was  coded  using  the  new  C*  programming  language,  an  extension  of  standard  C  for  data  parallel 
programming. 

Finding  a  “fair”  set  of  input  networks  to  test  the  parallel  push-relabel  method  is  not  an  easy 
task.  One  can  easily  find  instances  that  cause  the  program  run  very  slowly.  For  instance,  since 
the  number  of  pulses  is  at  least  as  large  as  the  length  of  the  shortest  path  from  s  to  t,  the  program 
will  perform  poorly  on  any  graph  with  large  s  —  f  distance.  A  simple  path  of  size  64K  will  require 
64K  pulses,  each  taking  several  routing  cycle  of  the  Connection  Machine,  which  is  large  compared 
to  memory  access  time  of  a  typical  sequential  computer.  In  fact,  in  this  degenerate  case  only  one 
node  is  active  at  a  time,  and  our  implementation  exhibits  no  parallelism.  On  the  other  hand,  if 
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Figure  7:  The  dependency  graph  in  the  pulse  procedure. 
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pulse  3  >■ 

Figure  8:  Unfolded  dependency  graph  of  the  pulse  procediire  with  pipelined  steps. 
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Figure  9:  Unfolded  dependency  graph  of  push-push-reiabel  procedure  with  pipelined  steps. 
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a  graph  consists  of  64K  parallel  arcs  connecting  the  source  to  the  sink,  our  implementation  will 
terminate  in  just  one  pulse. 

One  class  of  networks  we  used  is  generated  as  follows  [8]:  imagine  an  infinite  pipe  with  a  mesh 
drawn  on  it.  Suppose  the  pipe  goes  from  west  to  east.  The  distance  from  a  node  of  the  mesh  to 
the  nearest  neighbor  in  both  horizontal  and  vertical  directions  is  one,  and  the  circumference  of 
the  pipe  is  D.  First  we  construct  a  graph  on  the  nodes  of  the  mesh.  In  this  graph,  every  node 
has  out-degree  26 x  -i-  25y.  To  construct  the  graph,  we  connect  each  node  v  by  a  directed  arc 
to  all  nodes  within  6x  due  east  and  west,  and  within  Sy  due  north  and  south.  The  capacity  of 
an  arc  (u,  w)  depends  on  the  distance  x  between  v  and  x  in  the  mesh.  The  capacity  is  selected 
from  a  uniform  distribution  on  the  interval  [0,min(1.8®,  10000)].  To  complete  the  construction, 
we  introduce  a  source  s  and  a  sink  t.  Then  we  cut  the  pipe  by  two  planes  perpendicular  to  the 
axis  of  the  pipe.  The  cutting  planes  are  distance  i  apart.  We  consider  the  portion  of  the  pipe 
that  is  between  the  two  planes,  and  identify  all  nodes  to  the  west  of  the  west  plane  into  source 
5  and  all  nodes  on  the  east  of  the  east  plane  into  sink  t.  All  the  arcs  cut  by  the  left  plane  are 
connected  to  the  source,  and  those  cut  by  the  right  plane  are  connected  to  the  sink.  In  Figure  10 
we  list  the  node  and  arc  sizes  of  mesh-on~pipe  graphs  we  experimented  with. 
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Figure  ICh  Characteristics  of  mesh-on~pipe  graphs. 


The  size  parameter  is  larger  than  sum  of  nodes  and  arcs  because  for  each  arc  (v,  w)  we  had  to 
create  the  pair  arc  (w,  v).  The  task  of  reading  the  input  is  done  by  the  front  end  machine  (which 
is  a  SUN  4  system),  and  involves  reading  one  line  of  input  at  a  time,  allocating  a  processor  for 
the  arc  that  was  read  and  its  pair,  and  sending  the  information  (head  and  tail  of  the  arc  and  its 
capacity)  to  the  allocated  processors. 
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Figure  11;  Graphs  generated  by  the  “gennnf”  generator. 

W’e  also  experimented  with  a  number  of  other  networks  provided  for  the  DIMACS  Challenge, 
namely  the  ones  generated  by  the  “genrmT’  generator,  fuHy  dense  acyclic  networks,  and  some 
moderately  sparse  grids.  In  figures  11,  12,  and  13  we  Ust  sizes  of  the  networks  that  we  used  in 

our  experiments. 

For  each  of  these  networks  we  have  run  four  basic  algorithms:  push-relabel  without  pipelining, 
push-relabel  with  pipelining,  push-push-relabel  without  pipelining  and  push-push-relabel  with 
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Figure  12:  Complete  acyclic  networks . 
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Figure  13:  Moderately  sparse  Basic  Line  Mesh  networks. 

pipelining.  In  all  cases  the  simple  relabel  is  used  in  the  pulse  procedure,  except  at  certain  pulses 
gap-relabel  or  BFS  operation  was  is  used  instead,  as  discussed  in  the  previous  section.  Gap-relabel 
w^as  used  every  other  k  pulses  for  some  parameter  k.  Although  the  best  parameter  depends  on 
the  structure  of  the  input  graph,  we  fixed  k  —  10,  so  that  the  restdts  are  comparable.  For  BFS, 
as  was  mentioned  earlier,  we  compare  the  amount  of  “work”  (in  our  case  the  number  of  routing 
cycles)  done  by  the  last  BFS,  and  if  it  exceeds  k  times  the  accumulated  “work”  done  by  simple 
relabels  then  we  use  BFS.  Again  the  optimal  value  of  k  varies  depending  on  the  structure  of  the 
graph;  in  our  .experiments  we  used  i  =  2  throughout.  We  have  observed  that  the  versions  using 
gap-relabel  procedure  almost  universally  outperform  those  which  use  BFS.  Figure  14  shows  the 
computational  results  when  gap-relabel  procedure  is  used.  Note  that  for  large  enough  problems, 
our  algorithm  would  run  almost  twice  as  fast  on  a  32K  processors  due  to  a  lower  VP  ratio,  (Recall 
that  our  data  is  for  16K  processors.) 

16. 

In  order  to  demonstrate  the  effect  of  changing  the  parameter  k  for  both  algorithms  that  use 
gap-relabel,  and  those  that  use  BFS,  we  report  the  behavior  of  two  algorithms  on  the  network 
rmfl.  The  results  for  push-push-relabel  without  pipelining,  with  BFS  and  k  changing  from  2.0  to 
3^0  in  increments  of  0.2  are  shown  in  figure  15,  while  those  of  push-relabel,  without  pipelining,  with 
gap-relabel  and  for  k  changing  from  10  to  14  in  increments  of  1,  on  the  same  network  are  shown 
in  Figure  16.  Generally,  we  have  observed  that  for  7  <  <  15  for  gap-relabel  and  1.5  <  i  <  3.0 

for  BFS  all  variants  perform  well.  However,  within  these  ranges  there  is  no  unique  best  value  of 
k  and  one  may  see  oscillations  in  the  total  number  of  pulses  and  in  the  CPU  running  time  as 
k  chcuiges.  This  phenomenon  may  be  attributed  to  the  fact  that  it  is  not  just  the  frequency  of 
BFS  or  gap-relabel  operations  that  affects  the  total  number  of  pulses,  but  the  instance  that  they 
are  used  is  also  important.  The  effect  of  calling  such  an  operation  is  influenced  by  the  current 
structure  of  the  residual  graph.  It  is  not  clear  how  to  determine  the  most  appropriate  moment  to 
apply  the  accurate  relabeling  without  doing  a  lot  of  time-consuming  operations. 

6  Conclusions 

We  described  several  peiraJlel  implementations  of  the  ptish-relabel  method  for  the  maximum  flow 
problem  and  evaluated  several  heuristics  that  improve  the  practical  performance  of  the  method. 
Our  results  are  impressive  on  some  classes  of  problems.  They  also  provide  feedback  for  the 
designers  of  parallel  computers. 
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Figure  14:  number  of  iterations  of  pulse  operation  and  running  time  (in  seconds)  of  pipelined 
and  unpipelined  push-relabel  procedure  with  gap-relabel  used  every  10  pulses.  For  all  runs  16K 
processors  were  used. 
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Figure  15:  Effect  of  push-push-relabel  without  pipelining  and  using  BFS  on  network  rmfl. 
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